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Abstract 

This  paper  describes  four  APL-like  primitives  for  manipulating  dense  matrices  and  vectors 
and  describes  their  implementation  on  the  Connection  Machine  hypercube  multiprocessor. 
These  primitives  provide  a  natural  way  of  specifying  parallel  matrix  algorithms 
independently  of  machine  size  or  architecture  and  can  actually  enhance  efficiency  by 
facilitating  automatic  load  balancing.  We  illustrate  their  use  in  three  numerical  algorithms: 
a  vector-matrbc  multiply,  a  Gaussian-elimination  routine  and  a  simplex  algorithm  for  linear 
programming.  We  describe  implementations  of  the  primitives  assuming  load-balanced 
embeddings  of  matrices  and  vectors  on  a  hypercube  multiprocessor  architecture.  The 
primitives  may  indicate  a  change  from  one  embedding  to  another.  The  implementations 
are  efficient  in  the  frequently  occurring  case  where  there  are  fewer  processors  than  matrix 
elements.  In  particular  if  there  are  m  >  pig  p  matrix  elements,  where  is  the  number  of 
processors,  then  the  implementations  of  some  of  the  primitives  are  asymptotically  optimal 
in  that  the  processor-time  product  is  no  more  than  a  constant  factor  bdgher  than  the 
running  time  of  the  best  serial  algorithm.  Furthermore,  the  paraUel  time  required  is 
optimal  to  within  a  constant  factor. 

We  have  implemented  the  primitives  on  the  Connection  Machine  System,  and  this 
implementation  improved  the  performance  of  the  simplex  implementation  by  almost  an 
order  of  magnitude  over  a  naive  implementation,  from  S2  Mflops  to  SOO  Mflops.  We  give 
Connection  Machine  timings  for  the  primitives  and  the  algorithms. 
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Abstract 


1  Introduction 


This  paper  describei  four  APL-like  primitive*  for  ma¬ 
nipulating  dense  matrices  and  vectors  and  describe* 
their  implementation  on  the  Connection  Machine*  hy¬ 
percube  multiprocessor.  These  primitives  provide  a  nat¬ 
ural  way  of  specifying  parallel  matrix  algorithm*  inde¬ 
pendently  of  machine  size  or  architecture  and  can  ac¬ 
tually  enhance  efficiency  by  facilitating  automatic  load 
balancing.  We  illustrate  their  use  in  three  numeri¬ 
cal  algorithms;  a  vector-matrix  multiply,  a  Gaussian- 
elimination  routine  and  a  simplex  algorithm  for  lin¬ 
ear  programming.  We  describe  implementations  of  the 
primitives  assuming  load-balanced  embeddings  of  ma¬ 
trices  and  vectors  on  a  hypercube  multiprocessor  ar¬ 
chitecture.  The  primitives  may  indicate  a  change  from 
one  embedding  to  another.  The  implementations  are 
efficient  in  the  frequently  occurring  case  where  there 
are  fewer  processors  than  matrix  elements.  In  partic¬ 
ular  if  there  are  m  >  plgp  matrix  elements,  where  p 
is  the  number  of  processors,  then  the  implementations 
of  some  of  the  primitive*  are  asymptotically  optimal  in 
that  the  processor-time  product  is  no  more  than  a  con¬ 
stant  factor  higher  than  the  running  time  of  the  best  se¬ 
rial  algorithm.  Furthermore,  the  parallel  time  required 
is  optimal  to  within  a  constant  factor. 

We  have  implemented  the  primitive*  on  the  Con¬ 
nection  Machine  System,  and  this  implementation  im¬ 
proved  the  performance  of  the  simplex  implementation 
by  almost  an  order  of  magnitude  over  a  naive  imple¬ 
mentation,  from  52  Mflops  to  500  Mflops.  We  give 
Connection  Machine  timings  for  the  primitives  and  the 

algorithms.  _ 
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When  imi^ementing  a  parallel  algorithm,  it  is  conve¬ 
nient  to  have  a  high-level  parallel  language  which  pro¬ 
vides  the  convenience  one  hss  come  to  expect  ficom  well- 
established  aerial  languages.  One  wishes  to  concentrate 
on  the  details  of  the  algorithm,  allowing  the  language 
to  abstract  away  details  of  the  machine  including  the 
number  of  processors,  the  interconnection  network,  and 
the  embedding  of  data  elements  into  processors.  How¬ 
ever,  since  parallel  machines  are  currently  very  expen¬ 
sive  and  used  for  huge,  computationally  intensive  ap- 
plications,  nsers  often  will  not  give  up  performance  for 
ease  of  programming  and  portability.  Techniques  for 
mapping  high-level  descriptions  of  algorithms  onto  effi¬ 
cient  code  for  varions  parallel  machines  have  therefore 
become  very  important  and  will  probably  consume  a 
large  portion  of  computer  science  research  in  the  next 
decade. 

This  paper  provides  such  a  technique  by  showing 
how  very  high-level  descriptions  of  a  broad  class  of  dense 
matrix  algorithms  can  be  mapped  onto  a  real  machine, 
the  Connection  Machine,  with  no  performance  loss  over 
hand  coded  versions.  It  presents  4  high-level  APL-like 
primitives  and  illustrates  code  for  a  vector-matrix  mul¬ 
tiply,  a  Gaussiaa-elimination  routine  and  a  simplex  al¬ 
gorithm  for  linear  programming  baaed  on  these  primi¬ 
tives.  The  algorithm  are  straight-forward  implementa¬ 
tions  of  the  rlsBsir  algorithms.  The  code  is  high-level, 
it  works  for  any  siied  matrices^,  and  contains  no  infor¬ 
mation  on  how  data  is  mapped  onto  processors  nor  on 
how  the  data  should  be  communicated.  Therefore  it  is 
concise:  none  of  our  rontinen  contain  more  than  20  lines 
of  code.  The  paper  then  discusses  our  implementation 
of  the  primitives  on  the  Connection  Machine  and  gives 
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Scalu  Instructions; 

Global  Instructions: 

Broadcast,  g-min,  g-max,  . . . 

Elementwise  Vector  and  Matrix  Instructions: 

P-+.  P— .  P-*i  P"r.  ■  •  ■ 

Vector-Matrix  Instructions: 

insert,  extract,  distribute,  reduce 


Figure  1:  The  instructions  we  use  in  the  algorithms  de¬ 
scribed  in  this  paper.  <jlo6aI  instructions  compute  a  single 
value  from  all  elements  of  a  vector  or  matrix  or  broadcast 
a  single  value  across  a  vector  or  matrix.  Elemenlwite 
instructions  perform  an  operation  elementwise  over  corre¬ 
sponding  elements  of  equal  sized  matrices  or  vectors. 


actual  timings  for  both  the  primitives  and  two  of  the 
algorithms.  The  simplex  and  vector-matrix  multiply 
execute  faster  than  any  other  code  for  these  applica¬ 
tions  for  the  Connection  Machine.  With  256  matrix 
elements  per  processor,  an  iteration  of  simplex  runs  at 
500  Mflops  and  Matrix- Vector  multiply  runs  at  over  1 
Gflop  on  a  64K  CM-2  (single  precision).  Versions  of  the 
primitives  we  implemented  are  now  included  in  PARIS, 
the  parallel  instruction  set  of  the  Connection  Machine. 

The  four  primitives  we  consider  extract  a  vector  from 
a  row  or  column  of  a  matrix,  iruert  a  vector  into  a  row  or 
column,  distribute  a  vector  across  the  tows  or  columns, 
and  reduce  the  rows  or  columns  into  a  vector  using  a 
binary  associative  operator  such  as  -f,  maximum,  or 
minimum.  It  is  convenient  to  also  use  an  operation 
which  spreads  a  row  or  column  across  its  matrix  (extract 
followed  by  distribute).  We  also  assume  the  existence 
of  various  other  simple  primitives — these  primitives  are 
summarized  in  Figure  1. 

Although  the  extract  and  insert  primitives  might 
seem  trivial,  their  implementation  is  actually  as  com¬ 
plex  as  the  implementation  of  the  distribute  and  reduce 
primitives.  They  involve  rearranging  the  data  among 
the  processors  by  communicating  along  tr^  embed¬ 
ded  within  subcubes  occupied  by  th<  or  columns 
of  a  matrix.  This  data  transfer  gu«.' optimal 
load  balancing  of  any  row  or  column  %'  extracted 
from  a  matrix.  Thus  after  extraction  eacb  processor 
holds  (within  one)  the  same  number  of  vector  elements 
(=  \mlp'\  where  m  is  the  number  of  matrix  elements 
and  p  is  the  number  of  processors).  Load  balancing  the 
vectors  improved  the  performance  of  the  simplex  algo¬ 
rithm  from  under  lOOMFlops  to  over  500Mflops. 

There  has  been  considerable  research  on  implement¬ 
ing  dense  matrix  algorithms  on  hyperenbe  based  ma¬ 
chines  [14,  9,  12,  7,  3,  4].  This  paper  concentrates  on 
how  to  get  similar  results  without  having  to  code  for  the 
particular  machine.  The  final  execution  of  the  linear 
systems  solver  is  similar  to  the  algorithm  suggested  for 


a  hypercube  by  Johnsson  [12].  Our  implementations  of 
the  primitives  are  simple,  dean  subcases  of  the  n  edge- 
disjoint  spanning  binomial  trees  (NSBT)  algorithnu  due 
to  Johnsson  and  Ho  [13].  Our  extract  implementation 
is  a  version  of  what  they  call  one-to-all  personalized 
communication  within  subcubes  and  our  distribute  im¬ 
plementation  is  an  example  of  all-to-all  broadcasting. 
Other  related  algorithms  for  hyperenbes  are  discussed 
by  Fox  and  Furmasld  [8],  Stout  and  Wager  [17],  and 
Deshpande  and  Jenevin  [6]. 

To  motivate  some  of  the  decisions  we  made  in  the 
selection  and  implementation  of  primitives,  let  ns  ex¬ 
amine,  as  an  example,  the  solution  of  linear  programs 
using  the  Dantsig  simplex  method.  The  standard  form 
of  a  linear  programming  problem  is  as  follows: 

minimize  c^x  such  that  t  T 

X  >  0 

where  c  is  an  ma-dimensional  integer  objective  /unction 
vector,  A  is  an  mi  X  nt}  integer  constraint  matrix,  b  is 
an  mi -vector  of  integers,  and  x  is  a  real  mj-vector  of 
unknowns.  All  information  needed  to  perform  a  step  of 
the  computation  is  kept  in  a  tableau  which  is  initially  the 
constraint  matrix  A  augmented  by  the  column  vector  h 
and  the  row  vector  c.  A  single  step  of  the  computation 
is  then  one  step  of  Gaussian  elimination  on  the  entire 
tableau  where  the  pivot  column  has  the  most  negative 
value  of  vector  e  and  the  pivot  row  has  the  smallest 
positive  valne  of  bfa  within  the  pivot  column.  Section  2 
explains  some  of  the  ideas  behind  thr  method — for  now, 
an  understanding  of  the  functionality  suffices. 

If  we  have  as  many  processors  as  tableau  elements, 
a  straight-forward  implementation  of  one  simplex  step 
can  be  written  as  follows: 

Algorithm  5implex 

'.tableau  T  ((mi  -f  1)  x  (mj  +  1)) 

;mi(ia/lp  matrix  A  augmented  by 
;colnmn  vector  B  and  row  vector  C 

repeat  forever: 

1  selecting  processors  that  initially  held  vector  C 

2  pivotcolmmm  s  column  #  of  processor 

holding  g-min  (7^ 

(if  g-mia(T)  >  0,  exit  Simplex  successfully) 

3  selecting  processors  that  initially  held  vector  B 

4  send  valne  of  T  to  pivot  column 

(store  locally  as  B) 

5  selecting  processors  in 

Pivot  column  with  T  >  0 

(if  none,  exit  Simplex  unsuccessfully) 

6  Ratio  s  p-+(H,  T) 

7  pivotroumum  «  row  #  of  processor 

holding  g-mia(Ratio) 

8  pivotelement  m  A\pivotcelnum]\pivotroumum] 

9  selecting  processors  in  the  pivot  row 

10  r  s  p-!-(T,  BroidcMtt(pivotelement)) 
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Figure  2:  Euch  proceMor  in  m  P  proceMor  hypetcnbe 
viewed  as  a  pi  x  pj  grid  of  processors  holds  a  ibi  x  hj 
submatrix  of  the  mi  x  m3  matrix  (P  =  P1P3,  mi  =  pikt 
and  m3  =  P3ib3)- 


11  send  T  value  of  pivot  row  to  all  rows 

(store  in  Pivotrow) 

12  send  T  value  of  pivot  column  to  all  columns 

(store  in  Pivoted) 

13  T  =  p-—(T,p-»(Pivotrow,  Pivoted)) 

In  general,  however,  we  would  like  to  run  problems 
in  which  the  tableau  is  larger  than  the  number  of  pro¬ 
cessors.  Figure  2  shows  how  we  might  map  an  mt  x  m3 
matrix  onto  a  machine  where  the  proceanon  can  \>e  logi¬ 
cally  configured  as  a  grid.  Each  processor  holds  a  ki  x  ka 
submatrix  which  is  as  square  as  possible  (an  aspect  ra¬ 
tio  of  at  most  2).  The  runtime  environment  can  keep 
track  of  how  a  matrix  is  mapped  onto  the  processors, 
and  can  automatically  loop  over  all  the  elements  in  each 
processor  when  operating  on  the  matrix. 

This  embedding  is  load  balanced  with  respect  to  the 
matrix  since  each  processor  holds  an  equal  number  of 
matrix  elements.  Any  given  row  or  column,  however, 
when  viewed  as  a  vector  is  not  well-balanced  at  all  as 
shown  in  Figure  3.  In  particular,  when  performing  the 
divisions  in  lines  6  and  10  and  the  minimums  in  lines  2 
and  7  in  the  simplex  algorithm  above,  a  small  subset  of 
the  processors  must  perform  many  computations  while 
the  rest  of  the  processors  are  idle.  When  computing 
on  a  single  column,  as  in  line  6,  it  may  be  more  effi¬ 
cient  to  use  the  underlying  interconnection  network  to 
spread  the  work  from  the  one  active  processor  in  each 
row  to  the  idle  processors  in  its  row.  For  example,  in 
computing  on  the  column  selected  in  Figure  3,  the  two 
active  processors  distribute  one  operation  to  each  of  the 
three  idle  processors  in  their  rows  so  that  each  proces¬ 
sor  perforins  one  operation.  Furthermore,  it  may  be 
beneficial  to  always  leave  the  B  and  C  vectors  stored  in 
this  load-balanced  fashion  and  to  update  them  in  place. 
In  our  implementation  on  the  Connection  Machine  Sys¬ 
tem,  a  carefully  coded  implementation  similar  to  the 
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Figure  3:  All  the  dements  of  a  single  column  arc  located 
in  t  small  subset  of  the  processors.  In  the  example  they 
arc  located  in  2  of  the  8  processors.  When  extracting  a 
column,  the  demenu  arc  distributed  across  the  processors 
so  that  each  processor  gets  one  dement  of  the  column. 
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code  above  ran  at  52  Mflops.  After  adding  the  load 
balancing,  the  loop  ran  at  500  Mflops. 

In  seeking  to  generalise  from  this  experience,  we 
found  that  the  fonr  desses  of  primitives  provide  ex¬ 
pressive  power  and  potentially  automatic  effidency  im¬ 
provement.  In  the  simplex  example,  our  inefficiendes 
came  from  treating  vectors  as  matrices.  Distinguishing 
between  vector  and  matrix  computations  is  not  only 
effident,  bnt  also  fits  wdl  into  the  way  programmers 
think  about  computing  on  these  objects. 

Section  2  gives  examples  of  the  use  of  the  primitives 
in  several  appUcations  indnding  the  simplex  method 
for  linear  programming.  The  fonr  daases  of  primitives 
provide  cues  that  computation  is  shifting  from  a  full 
matrix  to  a  vector  rdated  to  that  matrix  (eg.  a  row 
or  oolnmn),  indicating  that  it  may  be  a  proper  time 
for  load  balandng.  Bookkeeping  rdated  to  the  looping 
over  matrix  or  vector  dements  within  a  processor  is  best 
executed  at  runtime  rather  than  compUe  time  because 
the  binary  code  should  be  independent  of  the  number 
of  processors  in  a  machine.  A  compiler,  however,  can 
frequently  dedde  whether  is  is  better  to  load  balance 
during  an  extract  or  reduce  operation  or  to  leave  the 
dements  in  place. 

Section  3  discusses  the  implementations  of  the  prim¬ 
itives  for  hpperetiie  multiprocessors  with  N  =  2"  pro¬ 
cessors.  Each  processor  has  a  unique  address  which  is 
Ig  W  s  n  bits  long  and  each  processor  is  directly  con¬ 
nected  to  all  processors  whose  address  differs  from  iu 
own  in  exactly  one  bit  position.  Two  connected  pro¬ 
cessors  whose  address  differs  in  Ut  •  are  called  ith- 
dimetuional  neighbors.  The  algorithms  we  present  in 
this  paper  work  on  any  hypetcnbe  multiprocessor,  but 
they  are  simple  enon^  to  run  (»  a  data  paraUel  ar- 
ehitaetmre  in  which  each  processor  executes  the  same 
instruction  broadcast  from  a  front  end  computer  (also 
knosm  as  a  single  instruction  multiple  data  architec¬ 
ture). 


Section  4  gives  timings  for  sn  implementstion  of 
our  primitives  on  the  Connection  Machine  System  [10]. 
The  current  implementations  are  restricted  in  two  ways. 
Firstly,  the  number  of  rows  and  columns  in  a  matrix 
must  be  a  power  of  two.  Secondly,  a  vector  extracted 
from  a  row  of  a  matrix  can  only  be  distributed  to  or  de¬ 
posited  in  a  row  of  a  matrix  of  equal  size  as  the  matrix  is 
was  extracted  from.  In  Section  3  we  discuss  techniques 
for  avoiding  the  first  restriction.  The  Connection  Ma¬ 
chine  has  a  router  for  arbitrary  interprocessor  commu¬ 
nication  and  its  processors  have  the  ability  to  indirectly 
address  their  memory;  our  algorithms,  however,  do  not 
require  the  power  of  these  features. 

Remarkably,  although  our  implementations  are  very 
simple,  the  reduce  and  distribute  primitives  are  opti¬ 
mal  for  a  hypercube  in  two  senses;  parallel  time,  and 
processor-time  product  (work),  provided  the  matrix  is 
sufficiently  large  relative  to  the  number  of  processors 
and  each  processor  is  capable  of  sending  only  a  con¬ 
stant  number  of  messages  in  unit  time. 

2  Example  Applications 

In  this  section,  we  present  CM  implementations  of 
matrix-vector  multiplication,  LU  decomposition  with 
partial  pivoting,  solution  of  a  linear  system  from  an  LU 
decomposition  and  a  vector  i,  and  a  simplex  method 
for  linear  programming.  These  have  been  implemented 
using  the  primitives  we  described  and  are  competitive 
with  ail  other  CM  implementations  to  date. 

2.1  Matrix- Vector  Multiply 

The  first  example  we  present  is  a  matrix-vector  multi¬ 
ply.  Using  a  distribute,  we  spread  the  input  vector  over 
all  rows  of  the  matrix,  multiply  the  two  element-wise, 
and  reduce  ail  rows  of  the  result  by  summing  across  the 
rows,  returning  a  column  vector  containing  the  matrix- 
vector  product.  Thus, 


as  we  would  if  we  were  investigating  a  Markov  process. 
At  some  cost  in  storage,  we  cslo  store  both  the  transi¬ 
tion  matrix  and  its  transpose,  then  alternate  which  one 
we  use.  Alternativdy,  we  can  pay  Ig  n  time  to  transpose 
either  the  vector  or  the  matrix. 

Algorithm  Uatrix-Veetor  Multiply  (MVM) 

;matrit  A  (mi  x  m3) 

;pector  B 

•.temporary  matrix  TEMP  (mi  x  m3) 

1  Temp  s  distribute-row(S) 

2  Temp  =  p-*(A,  Temp) 

3  return(-f-rcducc-to-column(Temp)) 


2.2  Linesur  System  Solution  by  LU  Decom¬ 
position 

The  next  example  demonstrates  solution  of  a  linear  sys¬ 
tem  by  LU  decomposition  with  partial  pivoting  and 
back  solving.  Given  an  m  x  m  matrix  A,  to  compute 
upper  triangular  matrix  U  and  lower  triangular  matrix 
L  such  that  A  ^  LU,  we  perform  m  steps  of  Gaussian 
elimination.  Moving  left  to  right  through  the  matrix,  we 
extract  the  ctdumns  in  sequence.  We  find  the  element 
with  the  greatest  absolute  value,  and  extract  its  row. 
We  then  divide  the  pivot  column  by  the  pivot  element 
itself,  and  distribute  the  pivot  row  and  column  across 
the  matrix.  We  replace  the  part  of  the  pivot  column  be¬ 
low  the  pivot  element,  where  it  will  serve  as  a  column 
of  L.  Finally,  we  perform  Gaussian  elimination  on  the 
square  below  and  to  the  right  of  the  pivot  row  and  col¬ 
umn.  Rather  than  physically  exchanging  the  two  rows, 
which  would  yield  a  true  lower  and  upper  triangular 
matrix,  we  record  in  a  separate  vector  which  row  was 
eliminated  in  which  iteration  and  use  this  information 
to  find  the  diagonal  in  the  solution  phase. 

Algorithm  LU  Deeompoeition  (LUD) 
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and  the  sums  of  the  elementwise  multiplications  are 
taken  along  the  rows  to  yield  the  final  result. 

One  complication  is  that  our  algorithm  accepts  a  row 
vector  as  input  but  returns  a  column  vector.  This  condi¬ 
tion  causes  less  of  a  problem  if  we  want  to  immediately 
use  the  vector  in  another  matrix-vector  multiplication. 


;matrir  A  (m  x  m) 

;LU  decomposition  with  partial  pivoting 
'.Returns  LU  in  A,  permutation  vector  in  P 

1  For  i  from  1  to  m  Do 

2  selecting  rows  and  columns  of  A 

that  have  not  been  pivoted  on 

3  Col  s  cxtract-column(A,  i) 

4  pivotrow  =  row  #  of  g-max(p-abe) 

5  Row  s  extract-row(  A,  pivotrow) 

6  P\pivotTow\  s  I 

7  pivotelement  s  A[i,pivotroii)] 

8  Col  w  p-i-(Col,  Broadcatt(piootelement)) 

9  Aw  da4>osit-oolumn(  A,  Col,  •) 

10  Colmetn's  w  diBtrilNite-cotiimn(Col) 

11  Aowmatris  w  diBtribute-row(JZ^) 

12  selecting  positions  in  A  that  are  not 

in  the  pivot  row  or  column 


13  A  =  P-— (i4,  p-*(Aouimatriz,Co/ma(riz)) 

The  forward  and  back  aolution  phase  is  also  straight¬ 
forward.  Given  a  vector  6,  we  first  solve  Lf  s  i,  divide 
y  and  U  through  by  the  diagonal  snch  that  the  system 
has  a  unit  diagonal,  and  then  solve  Uz  =  A  solution 
step  consists  of  extracting  a  column,  multiplying  it  by 
the  diagonal  element,  and  subtracting  from  6. 

Algorithm  Soltte  Linear  Syetem  from  Deeompoeition 

;ma(rix  LU  (m  x  m) 

; vector  B 

;permutation  vector  P 
;Solve  LU x  =  B 

1  For  i  from  1  to  m  Do 

2  Column  —  extract-column(.Lf/,  t) 

3  pivot  =  such  that  i  =  p[k] 

4  selecting  the  elements  B\J\  and  Column\j\ 

such  that  P[j]  >  i 

5  B  =  p— (J9,p-*(Co{«mn,Broadcatt(pi«o<))) 

6  send  logical  diagonal  of  LU  to  first  col  of  Temp 

7  Diag  =  extract-column(Temp,  1) 

■,At  thit  point  B  containe  y  from  Ly  =  b 
;  Divide  U  and  B  by  the  diagonal. 

8  B  =  p-~(B ,  Diag) 

9  Temp  s  di$tribute-coiumn(Diay) 

10  Selecting  positions  in  the  logical 

upper  triangle  of  LU 

11  LU  ^p-i-{LU,Diag) 

12  For  i  from  m  downto  IDo 

13  Column  =  extract-column(I17,  i) 

14  pivot  =  D[k]  such  that  P[k]  =  i 

15  selecting  the  elements  B[J]  and  Column\j\ 

such  that  P[j]  <  i 

16  B  —  p--;-(D,  p-*(Cofttmn,  Broadcast(pivo())) 

17  unpermute  B 

2.3  Simplex  method  for  linear  program¬ 
ming 

Our  final  example  illustrates  the  Simplex  method  for 
solving  linear  programming  problems.  The  standard 
form  of  a  linear  programming  problem  is  as  follows: 

...  T  j.  »i.  •  /  ~  ® 

mimituze  c  z  such  that  <  ^  n 

^  z  >  0 

where  c  is  an  m^-dimensional  integer  objective  function 
vector,  A  is  an  mi  x  mj  integer  constraint  motrtz,  b  is 
an  mi -vector  of  integers,  and  z  is  a  real  mj-vector  of 
unknowns.  Generally  we  have  mi  <  mj. 

A  vector  z  such  that  Ax  ^  h  and  z  >  0  is  called  a 
feaeMe  eolation  because  it  satisfies  all  the  constraints. 
If  a  linear  program  has  an  optimal  solution,  we  can  al¬ 
ways  find  one  such  that  mi  of  the  entries  in  vector  z  are 


equal  to  0  [IS].  Snch  vectors,  called  haste  feaeMe  solu¬ 
tion*,  correspond  geonetrically  to  comers  of  the  convex 
(ms — mi  )-dimmsionnl  polyt<^  of  all  feasible  solutions. 
The  simplex  method  Car  solving  Knear  programs  due  to 
Dantsig  [5]  starts  at  a  basic  feasible  solntion  and  pivot* 
to  a  new  l^c  feasible  solntion  which  improves  the  ob¬ 
jective  function.  Algebraically,  we  increase  one  of  the 
sero-valned  nonhastc  variables  (the  entering  variable) 
nntil  one  of  the  non-sero  basic  variables  becomes  zero. 
In  the  Dantzig  method  of  pivoting,  the  entering  variable 
is  the  one  that  will  decrease  the  objective  function  by 
the  most  (per  nnit  increase  in  the  variable). 

All  the  information  necessary  to  perform  the  pivot¬ 
ing  is  kept  in  a  tableau  where  the  objective  function  and 
all  nonbasic  variaUes  are  represented  in  terms  of  the 
basic  variables.  Since  ctdnmns  corresponding  to  the  ba¬ 
sic  variables  always  form  the  identity  matrix,  we  save 
space  by  not  representing  these  columns  (the  code  in 
Section  1  does  not  indnde  this  optimization).  At  the 
start,  the  tableau  is  the  constraint  matrix  A  augmented 
by  the  colnmn  vector  b  and  the  row  vector  c.  Vectors  6 
and  c  are  also  maintained  in  vector  representation.  We 
then  use  Ganssian  elimination  to  eliminate  all  columns 
corresponding  to  basic  variables.  To  form  the  tableau 
for  which  one  basic  variable  is  replaced  by  a  nonbasic 
variable  then  involves  one  step  of  Ganssian  elimination. 

The  implementation  of  Simplex  with  Dantzig’s  rule 
is  fairly  straightforward.  We  first  find  the  index  of  the 
most  negative  coeffident  of  the  objective  function;  piv¬ 
oting  on  this  variable  will  give  ns  the  most  rapid  im¬ 
provement  in  the  solntion  per  nnit  increase  in  the  en¬ 
tering  nonbasic  variable.  If  there  ate  no  negative  coeffi- 
dents,  then  we  cannot  make  any  improvement,  and  thus 
have  finished  snccessfnlly.  We  then  extract  the  indexed 
column,  and  select  the  positions  corresponding  to  real 
constraints,  L  e.  only  positive  coeffidents  correspond 
to  basic  variables  that  decrease  as  the  entering  variable 
increases.  If  there  are  no  positive  coeffidents  in  the  col- 
nmn,  then  the  system  is  nnbonnded;  we  can  increase 
the  value  of  variable  improving  the  objective  function 
without  limit  and  never  violate  a  constraint.  To  find 
the  limiting  constraint,  we  divide  the  6  vector  by  the 
positive  elements  of  the  pivot  column  dementwise  and 
find  the  index  of  the  smallest  ratio.  The  two  indices 
define  the  pivot  dement.  We  then  perform  a  Gaussian 
elimination  step. 

Algorithm  Simplex  (space-saving) 

;tableau  A  ((mi  -f  1)  x  (m*  -f  1)) 
i  constraint  vector  B 
;  objective  function  vector  C 

repeat  forever: 

1  pivotcolnam  s  ool  #  of  eiemeat 

holding  g-min(C) 

(if  g-min(C)  >  0,  exit  Simplex  successfully) 

2  Pivotcol  scxtrsct-column(A,p««otcofnnm) 


3  selecting  positions  in  Pivoted  with  vslues  >  0 

(if  none,  exit  Simplex  unsnccessfully) 

4  Ratio  =  p— r(B,  Pivotcolamn) 

5  pivotroumam  =  row  of  element 

holding  g-min(Aa(to) 

6  Pivotrow  =  cxtract-row(i4,pivoirottm«m) 

•yupdate  pivot  row  and  column 

7  pivotelement  =  y4[pivo<co/niim][pieotrotim«m] 

8  Pivotrow  =  p-T-(  Pivotrow,  Bro»<lca%t(pivotelement)) 

9  Rowmatriz  =  spread-row(i’ivo(roui) 

10  Colmatriz  =  spread-column(Pieoico/) 

;  update  constraint  vector  and  objective  function 
;on  their  own,  even  though  updated  later 

11  value  =  A[m,  n] 

12  B  =  p~—(B,p-*(Pivotcolumn,Bro»dc»tt(value))) 

13  C  =  [>-*(Pivotcolumn,  value)) 

;  Update  the  tableau 

14  =  wsert-ro\M(  A,  Pivotrow, pivotrownum) 

15  selecting  positions  of  A  that 

are  not  part  of  pivot  row  or  column 

16  =  p-— (y4,  p-*(Pivotrow,  Pivotcolumn)) 

3  Implementation  of  Primitives 

In  this  section  we  present  efficient  implementations  for 
the  four  vector-matrix  primitives  on  hypercube  archi¬ 
tectures  and  analyze  the  time  and  processor-time  com¬ 
plexities  of  these  implementations.  The  implementa¬ 
tions  are  based  on  a  particular  embedding  of  matrices 
and  vectors  on  the  hypercube.  We  show  that  both  the 
time  complexity  and  the  processor-time  complexity  of 
the  reduce  primitives  are  within  a  constant  factor  of  op¬ 
timal  provided  that  the  matrix  is  sufficiently  large  rel¬ 
ative  to  the  number  of  processors.  Finally,  we  discuaa 
generalizations  to  higher  dimensional  matrices,  matri¬ 
ces  whose  dimensions  are  not  powers  of  2,  and  matrices 
where  rows  or  columns  are  favored. 

Figure  2  shows  how  we  map  an  mj  xmj  matrix  on  an 
N-processor  hypercube.  Each  processor  holds  a  it  x  hj 
submatrix  which  is  as  square  as  possible.  The  proces¬ 
sors  can  themselves  be  viewed  in  a  grid  (perhaps  in 
row-major  order  with  respect  to  hyperenbe  addresses), 
but  we  use  the  full  power  of  the  hypercube  connec¬ 
tions  in  our  implementation.  The  processors  holding 
a  given  row  of  the  matrix  form  a  Igpj-dimensional  sub- 
cube  of  the  n-dimensional  hypercube  and  the  given  row 
is  mapped  to  the  same  submatrix  row  in  each  processor. 
To  map  a  length  2”  vector  onto  an  n-dimensional  hyper¬ 
cube  configured  as  a  pi  x  grid,  we  map  2*~"  vector 
elements  to  each  processor  with  row  vectors  mapped  in 
column-major  order  and  column  vectors  in  row-major 
order.  Figure  4  illustrates  how  row  vectors  are  mapped 
onto  the  virtual  grid  both  when  v  >  n  and  when  v  <  n. 
The  embedding  of  vectors  as  described  above  is  load 
bdanced  in  that  each  processor  holds  an  equal  number 
of  elements. 

Based  upon  the  matrix  and  vector  embeddings  de¬ 


scribed  above,  we  describe  the  implementation  of  the 
row  version  of  the  extract  primitive.  The  imidementa- 
tion  of  the  other  primitives  are  similar  and  are  pictured 
in  Figures  5,  6,  and  7.  The  deposit  primitive  is  basi¬ 
cally  the  inverse  of  the  extract  primitive.  The  reduce 
primitive  is  similar  to  the  extract  primitive  but  as  well 
as  swapping  data  across  the  cube  to  load  balance,  the 
data  is  snmmed  along  the  way.  The  distribute  primitive 
is  basically  the  inverse  of  the  reduce  primitive. 

All  of  the  primitives  are  implemented  by  stepping 
through  the  dimensioas  of  the  hypercube  with  each  pro¬ 
cessor  simultaneously  communicating  with  its  neighbor 
in  that  dimension.  In  the  extract  and  reduce  primitives, 
the  number  of  data  elements  communicated  typically 
halves  on  each  dimension  step  (each  step  halves  the 
complexity).  In  the  deposit  and  distribute  primitives, 
the  number  of  data  elements  conununicated  typically 
doubles  on  each  dimension  step.  For  the  extract  and 
deposit  primitives,  only  one  processor  from  each  pair  of 
communicating  processors  sends  data.  For  the  .educe 
and  distribute  primitives,  both  neighbors  generally  send 
equal  amounts  of  data.  None  of  the  primitives  require 
indirect  addressing  on  the  processors,  and  the  control 
is  strictly  data  parallel. 

3.1  Extract 

The  extract  operation  takes  a  matrix  and  an  index  r 
within  the  bounds  of  the  matrix  and  extracts  row  r  as 
a  tow  vector  (see  Figure  4).  In  the  extract  procedure, 
we  step  throng  the  dimensions,  starting  at  the  highest 
dimension,  and  each  processor  containing  elements  of  r 
sends  half  of  its  elements  to  its  empty  neighbor  in  that 
dimension.  The  result  of  the  extract  is  that  the  vector 
elements  originally  held  in  the  processor  at  grid  address 
(r^yj)  are  evenly  distributed  among  the  processors  in 
column  j. 

Algorithm  Extract-Row 
•,matrix 

;row  number  r  (processor  row  rp,  offset  t„) 

1  Let  ro,  ri, . . .  ri,^j_i  be  the  bit  representation 
of  processor  row  rp. 

2For  i  =  0  to  Igpi  —  1  Do 

3  If  no  processor  has  >  1  element  of  row  r 

then  stop. 

4  Selecting  processors  with  elements  of  row  r 

5  if  ri  s  0 

6  then  send  last  kj/2"*'*  elements 

of  row  r  to  neighbor  in  dimension  i. 

7  else  send  first  elements 

of  row  r  to  neighbor  in  dimension  i. 

The  second  if  statement  guarantees  that  the  row  is  kept 
in  a  fixed  order.  Figure  4a  illustrates  the  case  where  the 
number  of  elements  per  row  in  each  processor  is  greater 


Figure  4;  Examples  of  the  algorithm  cxtract-row.  The 
grid  indicates  the  grid  of  processors.  Number  ranges 
indicate  elements  of  the  row  vector  being  extracted.  At 
the  start,  all  elements  are  in  a  single  row  of  processors. 
In  case  (a),  each  processor  ends  up  with  more  than  one 
vector  element.  In  case  (b),  some  processor  rows  do  not 
get  any  vector  elements.  In  both  cases,  the  final  vector 
is  in  column-major  order. 


Figure  5:  The  deposit-row  primitive  deposits  a  row  vec¬ 
tor  into  a  tow  of  a  matrix.  The  algorithm  is  basically 
the  inverse  of  the  cxtract-row  algorithm. 


than  the  number  of  processors  in  each  column  of  the 
processor  grid.  In  this  case  the  row  elements  ate  dis¬ 
persed  in  column-major  order  sdth  each  processor  get¬ 
ting  the  same  number  of  matrix  elements.  Figure  4b  il¬ 
lustrates  the  case  where  the  number  of  row  elements  per 
processor  is  less  than  the  number  of  processors  in  a  col¬ 
umn  of  the  processor  grid.  In  this  case,  each  processor 
has  at  most  one  matrix  element.  Which  rows  are  empty 
depends  upon  which  row  is  extracted.  If  no  exchanges 
are  performed  across  the  last  j  dimensions,  then  only 
processors  that  match  the  extracted  row  in  the  last  j 
bits  of  column  address  will  contain  vector  elements.  A 
modified  version  can  put  the  vector  in  canonical  form 
with  at  most  Igpi  extra  steps. 

We  now  count  the  number  of  messages  sent  where  a 
message  is  the  sending  of  one  matrix  value  by  each  ac¬ 
tive  processor.  During  the  first  communication  phase, 
each  active  processor  sends  ki/2  messages,  then  kj/A 
messages,  then  hj/8  and  so  on.  Each  time  a  message 
is  sent,  each  sending  processor  sends  one  element  to  a 
neighbor  which  has  fewer  elements.  Thus,  each  message 
phase  reduces  the  maximum  number  of  elements  per 
processor  by  one.  Therefore,  the  total  number  of  mes¬ 
sages  sent  is  equal  to  the  redaction  in  maximum  num¬ 
ber  of  elements  per  processor.  Since  at  moot 
elements  of  r  are  left  in  any  processor,  the  number  of 
messages  sent  is  kj  —  fij/pi]. 

Using  similar  arguments  the  message  complexities 
of  deposit  and  distribute  are  at  most  kj  -  (kj/pi]  -f 
IgPi,  respectively.  The  reduce 

primitive  has  the  same  message  complexity  as  distribute 
and  it  has  an  operation  complexity  of  kjki  —  \kilp\\ 


Figure  6:  The  reduce-to-row  operation  focusing  on  one 
column  of  processors.  In  the  example  we  are  reducing  7 
columns.  The  ranges  indicate  the  processors  which  con¬ 
tain  the  partial  sums  of  the  given  indices.  In  each  step 
of  the  ruducs-to-row  operation,  each  processor  sends  half 
its  elements  to  its  neighbor  in  the  ith  dimension  and  ac¬ 
cumulates  what  it  receives  into  the  half  it  keeps.  The 
process  terminates  when  all  dimensions  are  processed, 
(a)  If  processors  contain  >  1  final  sum,  then  all  pro¬ 
cessors  have  the  same  number  of  sunu.  (b)  If  all  di¬ 
mensions  have  not  yet  been  crossed  but  processors  have 
at  moot  1  partial  s«n,  pwoensoes  with  higher  addresses 
send  their  partial  snm. 


Figure  7:  The  di»tributc-row  primitive  replic&te«  »  row 
vector  ftcroM  a  matrix.  Communication  proceeda  acroaa 
dimensions  of  the  processor  row  address  least  signifi¬ 
cant  bit  to  most  significant  bit.  Each  Processor  sends 
all  information  to  its  neighbor  and  receives  an  equiva¬ 
lent  amount  of  information,  thus  doubling  data  on  each 
step.  In  case  (b)  where  not  all  processors  hold  vector 
elements,  the  first  steps  replicate  single  values  until  all 
processors  have  one  vector  element.  Then  the  number 
of  values  sent  doubles  on  each  successive  step. 


ls(rPi/^2l)  processors  must  accumulate  a  local 

subtotal  before  cube  swapping. 


3.2  Analysis 

In  this  section  we  argue  that  our  algorithms  for  the 
reduce-to-row  operations  is  within  a  constant  factor  of 
optimal  in  two  ways.  First  we  show  that  the  paral¬ 
lel  time  (messages  plus  computation)  required  to  exe¬ 
cute  algorithms  reduce-to-row  is  within  a  constant  fac¬ 
tor  of  optimal.  This  is  also  true  of  the  distribute-row 
implementation  provided  the  embedding  is  nontrivial. 
Then  we  argue  that  the  total  work  which  is  the  prod¬ 
uct  of  parallel  time  (T)  and  number  of  processors  (P) 
is  within  a  constant  factor  of  the  work  required  for  any 
sequential  implementation  of  the  reduce  operation  pro¬ 
vided  ktkj  >  Igpi.  In  other  words,  our  algorithm  is 
optimal  if  the  number  of  matrix  elements  per  proces¬ 
sor  is  at  least  as  large  as  the  number  of  dimensions  of 
the  hypercube  (in  fact,  a  constant  fraction  smaller  than 
Ig  n  is  also  sufficient).  To  argue  optimality  in  terms  of 
time,  we  state  lower  bounds  for  the  number  of  messages 
and  operations  required  by  any  parallel  implementation 
of  reduce-to-row  and  compare  them  to  the  operation 
counts  argued  earlier  in  this  section.  We  then  compare 
minimum  number  of  operations  with  the  PT  product 
of  our  implementation  of  reduce-to-row.  To  achieve  the 
lower  bounds  we  use  a  model  where  each  processor  can 
only  send  out  one  message  in  unit  time. 


S.2.1  Parallel  Operation  Count 

To  prove  that  our  implementation  of  reduce-to-row  per¬ 
forms  at  most  a  oonstaat  factor  more  than  the  opti¬ 
mal  number  of  parallel  operations,  we  first  prove  that 
any  parallel  algorithm  for  raducc-to-row  must  send  at 
least  n(lgp)  messages  regardless  of  how  the  matrix  el¬ 
ements  are  embedded  in  the  hyperenbe.  To  complete 
the  optimality  proof  for  reduce-to-row,  we  finally  show 
a  lower  bound  of  fij  parallel  arithmetic  operations  on 
an  W  w  ptpg  processor  machine. 

We  now  argne  that  any  algorithm  that  computes 
rsduce-to-row  must  send  n(Igp)  messages.  Let  ns  as¬ 
sume  that  we  have  m  >  p/2,  because  otherwise  we  could 
run  our  algorithm  on  a  snbeube  of  the  machine.  Also, 
let  us  assume  without  loss  of  generality  that  mi  >  m^, 
so  that  m.  >  p/4.  Suppose  some  processor  contains 
n(]g  p)  elements  from  some  column.  Then  it  must  accu- 
mnlate  at  least  half  of  them  locally  or  send  at  least  half 
of  the  elements  to  other  processors,  thus  yielding  the 
lower  bound.  Suppose,  instead  that  no  processor  con¬ 
tains  more  than  Igp  elements  of  any  one  column.  Then 
some  column  must  be  embedded  in  at  least  mi  /  Ig  p  pro¬ 
cessors.  Therefore,  to  accumulate  the  c<dnmn  requires 
Igmi  -  Igigp  >  lg(p/4)  -  Iglgpi  messages.  Therefore, 
we  must  send  O(lgp)  messages. 

We  now  count  the  minimum  number  of  parallel 
arithmetic  steps  on  machine  with  N  =‘  pipj  proces¬ 
sors.  The  number  of  arithmetic  operations  that  must 
be  performed  is  m2(mi  -  1).  Since  there  are  only  pips 
processors  to  perform  that  operation,  we  must  use  at 
least  (mimj  --  mj)/pipa  w  n(kik})  time. 

Combining  the  separate  lower  bounds,  we  have 
that  the  minimum  time  to  perform  a  reducs-to-row  is 
O(lgp-l-kikj).  Since  the  number  of  messages  plus  oper¬ 
ations  executed  by  our  implementation  of  rsduce-to-row 
is  0(lgp-f  kikj),  our  implementation  is  optimal  in  terms 
of  asymptotic  parallel  execution  time. 

We  can  use  arguments  similar  to  those  above  to  show 
an  ()(kt  +  Igp)  lower  bound  on  message  complexity  for 
the  distribute  operation  provided  we  require  a  processor 
holding  e  elements  of  a  column  to  record  e  local  copies. 
If  the  processors  need  only  have  one  copy  for  each  col¬ 
umn  of  which  it  contains  any  elements,  then  we  can 
show  the  lower  bound  provided  the  matrix  is  embedded 
nontrivially.  By  nontrivial,  we  mean  that  no  one  proces¬ 
sor  can  contain  more  than  (1  —  l/p)m  of  the  m  matrix 
elements.  The  bask  idea  is  that  if  a  processor  contains 
all  or  most  of  a  column,  then  distributing  within  that 
column  is  trivial.  It  then,  however,  costs  that  processor 
considerably  to  distribute  along  rows. 

3.2.2  Processor-Time  Product 

In  this  section  we  show  that  the  product  of  arithmetic 
operations  times  number  of  processors  is  within  a  con¬ 
stant  factor  of  the  number  of  operations  required  by 
any  sequential  reduce-to-row  implementation.  As  ar¬ 
gued  above,  any  sequential  algorithm  must  perform 


mj(mi  —  1)  uithmetic  operAtion*. 

The  FT  product  for  rcducu-to-row  ia 

PT  =  l»ipa(*i*j +Ig(fpi/tj1)  -  f*j/pi1) 

XT  mimj +pip2lg(fp]/Jl3'{) -pipa 
<  mini] +pip2fcih]  —  piP3  ('h2/pi] 

because  we  aMume  that  kikt  >  Igpi.  Therefore  we 
have  that 

FT  <  2mim2 -pipi  fki/pil 

<  3mi  m]  —  3m]. 

This  proves  that  the  total  work  as  indicated  by  the 
processor- time  product  is  within  a  constant  factor  of 
optimal  provided  that  kikj  >  Igpi- 

3.3  Computing  on  a  Single  Row  or  Column 

In  this  subsection  we  discuss  the  cost  of  computing  on  a 
single  extracted  row  of  a  matrix.  In  particular  we  con¬ 
sider  whether  it  is  always  worth  load  balancing  rather 
than  just  moving  the  row,  or  column,  locally  within  the 
processors  that  contain  it.  We  consider  four  cases  that 
might  appear  in  practice; 

1.  We  extract  a  row,  operate  on  it,  and  distribute  it 
across  the  other  rows. 

2.  We  extract  a  row,  operate  on  it,  and  deposit  it 
back  in  place. 

3.  We  extract  a  row,  operate  on  it,  and  deposit  it 
back  into  another  row. 

4.  We  extract  a  row,  operate  on  it,  and  throw  the 
row  away  (for  example,  if  we  wanted  to  find  the 
maximum  in  a  row). 

The  load  balancing  advantage  costs  nothing  in  the 
first  case.  Since  spreading  a  row  across  to  all  others 
is  efficiently  implemented  as  an  extract  followed  by  a 
distribute,  we  simply  break  the  spread  into  its  two  pieces 
and  operate  on  the  load-balanced  representation  in  the 
middle.  The  decision  of  whether  or  not  to  load  balance 
the  vector  in  the  second  and  fourth  cases  should  be  a 
compile-time  or  run-time  decision,  and  will  depend  on 
bow  many  operations  need  to  be  performed,  and  the 
relative  time  of  communication  and  computation.  We 
have  found  in  the  applications  we  have  studied  that  the 
first  case  occurs  frequently,  so  it  is  therefore  often  worth 
load  balancing  when  extracting  a  row.  We  analyse  the 
second  case  to  give  an  example  of  the  considerations 
required  for  deciding  whether  it  is  worth  load  balancing 
or  not. 

In  the  second  case  we  want  to  extract  a  row,  oper¬ 
ate  on  it,  and  put  it  back.  Let  a  be  the  time  it  takes  to 
perform  a  single  arithmetic  operation  of  interest  (e.g.  a 
divide)  and  let  s  be  the  time  to  send  a  matrix  element 


over  a  hypercube  wire.  Let  «  m  fk]/pil  be  the  maxi¬ 
mum  number  of  elements  per  prooensot  after  load  bal¬ 
ancing.  Then  the  time  it  takes  to  compute  without  load 
balancing  is  kje.  If  we  extract  first,  compute  and  run 
extract  ia  reverse,  the  cost  of  the  entire  oompatation  is 
3(k]  —  f)s  tor  the  messages  pins  fs  for  the  computation. 
The  load  balancing  is  advantageous  exactly  when 


k]a 

> 

2(kj  -  «)s  ■+■  qa 

(*a  -  f  )* 

> 

2{ka  -  q)$ 

a 

> 

2s. 

Tkns  a  compiler  eeed 

only 

estimate  the  amount  of  time 

to  perform  the  arithmetic  and  the  amount  of  time  to 
send  a  matrix  element  over  a  cube  wire.  If  the  com¬ 
piler  determines  that  it  is  not  advantageous  to  do  load 
balancing  in  such  a  situation,  it  can  do  a  lory  extract 
which  simply  copies  the  row  to  a  local  array.  If  we  want 
to  store  the  result  vector  back  to  another  row  of  the 
matrix,  the  decision  is  not  entirely  configuration  inde¬ 
pendent  since  we  must  figure  the  cost  of  sending  a  row 
directly  (worst  case  Igpi)  vs.  the  cost  of  an  extract  plus 
deposit  (worst  case  2k]  ■t-lgpi),  making  load  balancing 
leas  favorable.  In  this  analysis  we  assumed  there  is  no 
pipeline  startup  cost  for  executing  multiple  arithmetic 
or  communication  steps.  With  a  pipeline  start  up  cost, 
the  decision  could  not  be  made  at  compile  time. 

3.4  Extensions 

In  this  subsection,  we  discuss  extensions  to  higher  di¬ 
mensional  matrices,  matrices  whose  dimensions  are  not 
powers  of  2,  and  matrices  where  rows  or  columns  are 
favored.  We  also  discum  ways  to  represent  vectors  in  a 
canonical  so  that  we  no  longer  distinguish  between  row 
and  column  vectors. 

If  we  have  t-dimensional  matrices,  the  addrem  of 
each  matrix  element  is  now  divided  into  (  pieces  cor¬ 
responding  to  indices  in  each  of  the  dimensions.  All 
implementations  will  extend  directly  by  operating  on 
the  appropriate  set  of  addrem  bits.  For  example,  to  ex¬ 
tract  ia  dimensions  1  and  2,  use  the  concatenated  bits 
of  the  processor  addresses  for  these  dimensions  in  place 
of  the  row  addrem  in  algorithm  extract. 

For  matriem  where  mi  and  ms  are  not  powers  of 
2,  we  embed  the  matrix  in  a  pi  x  pa  matrix  such  that 
each  processor  has  at  most  ki  s  ["ti/Fil  *^<1 
k]  E  fma/p]]  columns.  If  either  k|  or  k]  is  not  a  power 
of  2,  the  first  communication  phase  of  an  extract  and 
the  last  communication  phase  of  a  deposit,  etc,  will  have 
fewer  messages  than  normal,  namely  [k^/2J.  Otherwise, 
the  implementations  are  unchanged. 

If  we  wish  to  optimise  operations  on  rows,  taking 
a  penalty  for  operatioas  on  columns,  we  can  configure 
the  processor  grid  such  each  processor  holds  a  mini¬ 
mum  number  of  elements  from  the  same  row  (at  most 

r»»a/pl). 

Another  possible  extension  is  to  store  vectors  in  a 
canonical  form.  In  our  current  implementation  a  vector 


extracted  fiom  a  row  of  a  matrix  (a  row  vector)  will 
be  ordered  on  the  proceasora  differently  from  a  vector 
extracted  from  a  column  of  a  matrix  (a  column  vec¬ 
tor).  It  is  poasible  to  store  all  vectors  in  the  same  or¬ 
dering,  a  canonical  form,  with  no  additional  cost  to  our 
primitives.  This,  however,  requires  that  the  rows  and 
columns  of  a  matrix  are  mapped  onto  the  processors  in 
an  noncontiguous  order — one  row  of  a  matrix  will  no 
longer  be  adjacent  to  the  next.  This  will  make  nearest 
neighbor  communication  on  a  grid  more  expensive.  We 
believe  that  the  better  choice  is  to  keep  the  two  rep¬ 
resentations  aiid  swap  between  them  when  necessary. 
This  could  be  hidden  from  the  user. 


4  Timings 

We  present  the  timings  for  each  of  the  primitives  on 
the  CM-2  along  with  the  timings  for  the  vector  matrix 
multiply  and  Simplex  algorithms.  The  timings  are  for 
a  square  matrix  in  a  16384  processor  machine  running 
at  6.7  MHz.  Timings  reflect  the  total  time  that  the 
CM-2  was  busy.  Each  element  of  the  matrix  is  a  sin¬ 
gle  precision  (32  bit)  floating  point  number.  Timings 
for  the  full  65,536  processor  machine  are  extrapolated 
from  the  figures  for  the  smaller  machine.  We  give  the 
timings  for  different  values  of  ki  and  kj,  the  intrapro- 
cessoT  matrix  dimensions.  The  times  for  more  than  1 
element  per  processor  scale  sublinearly  with  the  total 
number  of  elements  in  each  processor.  The  times  are 
presented  in  milliseconds,  and  the  flop  rate  in  Mflops. 
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We  now  provide  an  example  of  performance  improve¬ 
ment  from  use  of  these  primitives.  A  carefully  coded 
naive  implementation  of  Simplex,  using  kj  =  8  and 
k}  s  8  ran  at  a  speed  of  190  milliseconds  per  iteration, 
which  is  equivalent  to  44  Mflops.  The  version  using 
the  primitives  discussed  in  this  paper  takes  1?  ms  per 
iteration,  which  is  equivalent  to  about  500  Mflops. 

The  distributs>row  implementation  used  for  the 
matrix- vector  multiply  timing  propagated  row  values 
through  the  inttaprocessor  matrix.  Performance  can  be 
improved  by  propagating  only  one  copy  of  each  value 
per  processor.  We  estimate  the  time  spent  distributing 
the  values  across  the  intraprocessor  matrix  to  be  as  high 
as  85%  of  the  overhead  for  16  x  16  intraprocessor  ma¬ 
trices.  Hence,  we  could  gain  considerable  performance 
by  implementing  this  optimization. 


^5  Conclusions 

In  this  paper  we  discuss  a  set  of  powerful  primitive  ma¬ 
trix  operations  whic^^ow  easy  spedflcation  of  parallel 
matrix  routines.  demonstrate Va  ouf  hypercube  im¬ 

plementation  that  the  additional  expressive  power  need 
not  reduce  performance  and  can,  in  (act,  improve  per¬ 
formance  by  providing  automatic  load  balancing  in  the 
case  where  there  are  more  matrix  elements  than  proces¬ 
sors.  We  deseiibe'^some  routines  based  on  these  primi¬ 
tives  and  other  simple  parallel  operations  an<f  give  some 
timings  for  the  primitives  and  routines  for  our  imple¬ 
mentation  on  the  Connection  Machine. 

In  the  future  we  hope  to  generalize  out  implementa¬ 
tion' of  the  primitives  so  they  work  on  processor  grids 
whose  row  and  column  sizes  are  not  powers  of  two,  and 
to  allow  a  vector  extracted  from  a  row  of  a  matrix  to  be 
distributed  to  or  deposited  in  either  a  row  or  column  of 
another  matrix.  WejJso  hope  to  make  oUk  primitives 
available  to  higher  level  languages  so  that  they  can  be 
easily  used. 


We  hope  that  thii  paper  epnit  intereat  in  developing 
n  small  set  of  simple  matrix  primitives  which  conld  then 
be  efficiently  implemented  with  a  consistent  interface  on 
a  large  number  of  machines. 
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